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Abstract 

We show that the two-stage adaptive Lasso procedure (Zou, 2006) is consistent for high-dimensional 
model selection in linear and Gaussian graphical models. Our conditions for consistency cover more 
general situations than those accomplished in previous work: we prove that restricted eigenvalue condi- 
tions (Bickel et al., 2008) are also sufficient for sparse structure estimation. 

1 Introduction 

The problem of inferring the sparsity pattern, i.e. model selection, in high-dimensional problems has re- 
cently gained a lot of attention. One important stream of research, which we also adopt here, requires 
computational feasibility and provable statistical properties of estimation methods or algorithms. Regular- 
ization with ^i-type penalization has become extremely popular for model selection in high-dimensional 
scenarios. The methods are easy to use, due to recent progress in convex optimization (Meier et al., 2008), 
(Friedman et al., 2008a), and they are asymptotically consistent or oracle optimal when requiring some 
conditions, e.g. on the design matrix in a linear model or among the variables in a graphical model 
(Greenshtein and Ritov, 2004; Meinshausen and Biihlmann, 2006; van de Geer, 2008), (Bickel et al., 2008). 
However, these conditions, referred to as coherence or compatibility conditions, are often very restrictive. 
The restrictions are due to severe bias problems with £i -penalization, i.e. shrinking also the estimates which 
correspond to true signal variables, see also Zou (2006), Meinshausen (2007). 

Regularization with the ^^-norm with q < 1 would mitigate some of the bias problems but become compu- 
tationally infeasible as the penalty is non-convex. As an interesting alternative, one can consider multi-step 
procedures where each of the steps involves a convex optimization only. A prime example is the adap- 
tive Lasso (Zou, 2006) which is a two-step algorithm and whose repeated application corresponds in some 
"loose" sense to a non-convex penalization scheme (Zou and Li, 2008). We are analyzing in this paper this 
adaptive Lasso procedure for variable selection in linear models as well as for Gaussian graphical modeling. 
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Both frameworks are related to each other and for both of them, we derive results for model selection under 
rather weak conditions. In particular, our results imply that the adaptive Lasso can recover the true underly- 
ing model in situations where plain £i-regularization fails (assuming restricted eigenvalue conditions). 

1.1 Variable selection in linear models 

Consider the linear model 

Y = Xf5 + e, (1.1) 

where X is an n x p design matrix, y is an n x 1 vector of noisy observations and e being the noise term. 
The design matrix is treated as either fixed or random. We assume throughout this paper that p > n (i.e. 
high-dimensional) and e ~ A^(0, cr^/„). 

The sparse object to recover is the unknown parameter /? € W. We assume that it has a relatively small num- 
ber s of nonzero coefficients: S := supp(/3) 
= {j : (3j / 0} and s = |supp(/3)|. Let /?min := min^gs Infemng the sparsity pattern, i.e. vari- 
able selection, refers to the task of correctly estimating the support set supp (/3) based on noisy observations 
from (1.1). In particular, given some estimator /?, recovery of the relevant variables is understood to be 

supp(/3) = supp(/3) with high probability. (1.2) 

Regularized estimation with the ^i-norm penalty, also known as the Lasso (Tibshirani, 1996), refers to the 
following convex optimization problem: 

P = argmin^lly - Xp\\l + A„||/?||i, (1.3) 
p In 

where the scaling factor 1 / (2n) is chosen by convenience and A„ > is a penalization parameter. It 
is an attractive and computationally tractable method with provable good statistical properties, even if p 
is much lai-ger than n, for prediction (Greenshtein and Ritov, 2004), for estimation in terms of the li- 
or ^2-loss (van de Geer, 2008; Meinshausen and Yu, 2009; Bickel et al., 2008) and for variable selection 
(Meinshausen and Biihlmann, 2006; Zhao and Yu, 2006; Wainwright, 2008). For the specific problem of 
variable selection, it is known that the so-called "neighborhood stability condition" for the design matrix 
(Meinshausen and Biihlmann, 2006), which has been re-formulated in a nicer form as the "irrepresentable 
condition" (Zhao and Yu, 2006), is necessary and sufficient for consistent variable selection in the sense of 
(1.2). Moreover, as this condition is restrictive, its necessity implies that the Lasso only works in a rather 
restricted range of problems, excluding cases where the design exhibits too strong (empirical) congelations. 
A key motivation of our work is to continue the exploration of a computationally tractable algorithm for 
variable selection, while aiming to relax the stringent conditions that are imposed on the design matrix X. 

Towards these goals, we analyze the adaptive Lasso procedure, see (2.2) below, for variable selection in the 
high-dimensional setting. This method was originally proposed by Zou (2006) and he analyzed the case 
when p is fixed. Further progress of analyzing the adaptive Lasso in the high-dimensional scenario has been 
achieved by Huang et al. (2008). A more complete understanding of its power, when applied to the high 
dimensional setting where p ^> n is still lacking. We prove in this paper that variable selection with the 
adaptive Lasso is possible under rather general incoherence conditions on the design. We do not require more 
stringent conditions on the design X than Bickel et al. (2008) who give the currently weakest conditions for 
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convergence of the Lasso in terms of ||/3 — /3||i and ||/3 — /3||2. We show that for an initial estimator /3init in the 
two-stage adaptive Lasso procedure with a sufficiently reasonable behavior of ||/3init — /3||oo> model selection 
is possible assuming only a lower bound on the smallest eigenvalue of XgXs/n, where Xs denotes the 
submatrix of X whose columns ai^e indexed by S, and some restrictions on /Jmin and the sparsity level s. 
Thus, variable selection is possible under rather general design conditions by the two-stage adaptive Lasso, 
and it is necessary to move away from plain £i-regularization, see Meinshausen and Biihlmann (2006), 
Zhao and Yu (2006). 

1.2 Covariance selection in Gaussian graphical models 

Covariance selection in a Gaussian graphical model refers to the problem of inferring conditional indepen- 
dencies between a set of jointly Gaussian random variables 

Xi,...Xp~iV(0,S) (1.4) 

(the restriction to mean is without loss of generality). These variables Xi, . . . , Xp correspond to nodes in 
a graph, labeled by {1, ... and a Gaussian conditional independence graph is then defined as follows: 

there is an undirected edge between node i and j <^ S,^^ ^ 0. 

The definition of an edge is equivalent to requiring that Xi and Xj sue conditionally dependent given all 
remaining variables {Xj.; k ^ For details cf. Lauritzen (1996). Estimation of the edge set is thus 

equivalent to finding the zeroes in the concentration matrix E~^. 

In the high-dimensional scenario withp > n, where n denotes the sample size of i.i.d. copies from (1.4), £i- 
type regularization has been analyzed. 

Meinshausen and Biihlmann (2006) prove that it is possible to consistently infer the edge set by consid- 
ering many variable selection problems in high-dimensional Gaussian regressions, again requiring a global 
neighborhood stability or irrepresentable condition which puts some restrictions on the covariance matrix 
S. Later, the GLasso penalization has been proposed (Friedman et al., 2008b; Banerjee et al., 2008) which 
is a sparse estimator for S"^ using an ^i-penalty on the non-diagonal elements of in the multivariate 
Gaussian log-likelihood. Ravikumar et al. (2008) recently obtained results for consistent covariance selec- 
tion ((i.e. inferring the edge set) using the GLasso by imposing mutual incoherence conditions (analogous 
to the neighborhood stability condition) on the Fisher information matrix (of size x p^) of the model, 
which is an edge-based counterpart of S. 

We focus here on generalizing conditions for the pursuit via many regressions: we prove in this paper a 
result for inferring the edge set in a Gaussian graphical model, under a rather general condition on S closely 
related to the restricted eigenvalue assumptions in Bickel et al. (2008) by analyzing the pursuit of many 
regressions with the adaptive Lasso. We conjecture that the set of conditions which we are imposing are 
more general than what Ravikumar et al. (2008) require when using the GLasso, although this is a point 
that needs to be thoroughly studied as we discuss further in Section 7. We also suspect that the GLasso 
approach is intrinsically more limited, in terms of restrictions for the covariance matrix S than the approach 
from Meinshausen and Biihlmann (2006) via considering many regressions. This has been recognized by 
Meinshausen (2008) and also studied by Ravikumar et al. (2008) on specific graphical models. On the 
other hand, for well-behaved problems, GLasso might have an advantage because it exploits the positive 
definiteness of S and 
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1.3 Related work 



Recently, Huang et al. (2008) studied the adaptive Lasso estimators in sparse, high-dimensional linear re- 
gression models for a fixed design. Under a rather strong mutual incoherence condition between every pair 
of relevant and iiTclevant covariates and assuming other regularity conditions, they prove that the adaptive 
Lasso recovers the correct model and has an oracle property. While they have derived the same incoherence 
condition as one (among others) of ours in (8.4a) in order for the second stage weighted Lasso procedure 
to achieve model selection consistency, they achieve it by an initial estimator assuming some strong mutual 
incoherence condition which bounds the pairwise correlations of the columns of the design. This is a much 
stronger condition than the restricted eigenvalue assumptions that we make, see Bickel et al. (2008). 

Meinshausen and Yu (2009) examined the variable selection property of the Lasso followed by a threshold- 
ing procedure. Under a relaxed "incoherence design" assumption, Meinshausen and Yu (2009) show that 
the estimator is still consistent in the ^2 -norm sense for fixed designs, and furthermore, it is possible to do 
hard-thresholding on the ordinary Lasso estimator to achieve variable selection consistency. However the 
choice of the threshold parameter depends on the the unknown value /5min and the sparsity s of /3. It is not 
clear how one can choose such a threshold parameter without knowing /^min or s. A more general frame- 
work for multi-stage variable selection was studied by Wasserman and Roeder (2008) for various methods 
and conditions. Their approach controls the probability of false positives (i.e. type I en^or) but pays a price 
in terms of false negatives (i.e. type II error) in comparison to the adaptive Lasso (Wasserman and Roeder, 
2008). 

Finally, our focus is rather different from that of Wainwright (2008, 2007), where the goal was to analyze 
the least amount of samples that one needs in order to recover a sparse signal via a random or a fixed 
measurement ensemble that satisfies strong incoherence conditions. It is an open problem to establish a 
lower bound on the sample size, given p, s and Pram, to recover the model with the adaptive Lasso, assuming 
restricted eigenvalue assumptions only. 

1.4 Organization of the paper 

In Section 2 we define the two-step adaptive Lasso procedure for linear regression and describe our main 
result: general model selection properties of the second stage weighted procedure for variable selection. 
Here, the initial estimator /3init can be general, and we assume a bound for ||/?init — /3||oo- Section 3 presents 
the restricted eigenvalue conditions we need for deriving bounds for ||/9init — /9||oo with the standard Lasso 
as initial estimator Pinn. In Sections 4, 5 and 6, we summarize conditions and results, with the standard 
Lasso as initial estimator, for linear regression with fixed design, linear regression with random design, and 
Gaussian graphical modeling, respectively. These results are consequences of our general result in Section 2. 
Section 8 presents a model selection lemma for the weighted Lasso with general weights. The remainder of 
the paper contains the proofs. 

2 The adaptive Lasso estimator and its general properties 

Consider the linear model in (1. 1). We distinguish later between fixed and random design. 
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2.1 The two-stage adaptive Lasso procedure 



The adaptive Lasso is the Lasso estimator with a re-weighted penalty function, see (2.2) below. The weights 
are estimated from an initial estimator f3init' 

Wj := max{-—^ — (2.1) 

\Pj,mit I 

We note that the original proposal of Zoii (2006) uses luj = l/|/3_;,initP for some 7 > with 7 = 1 the most 
common choice. The adaptive Lasso is now defined by a second-stage weighted Lasso: 

1 ^ 
(5 = argmin — ||y - X(3\\l + A„ V (2.2) 



2.2 Variable selection with the adaptive Lasso estimator 

Con^ect variable selection with the adaptive Lasso requires some conditions for the design. We first make 
some assumptions related to the design matrix. For a symmetric matrix A, let Amin(^) denote the smallest 
eigenvalue of A. 

For a fixed design matrix X, we define 

Amm(s) := mill min (2.3) 

Mo|<s T^jq-u 

We assume throughout this paper that Amin(s) > . As a consequence of this definition we have, 

Amin (^^) > ^min(s) > 0. (2.4) 

Furthermore, we assume for fixed design that the ^2-iiorm of each column of X is upper bounded by co\/n 
for some constant cq > 0. We then consider the set 



X'e 



< CQcra/ ). (2.5) 



n 



n 

The set T has large probability, as described below in (2.15). 
For a random design matrix X we assume: 

X has i.i.d. rows ~ iV(0, S), (2.6) 

where we assume without loss of generality that the mean is zero and Sjj = l,Vj = 1, . . . ,p. We then 
define 

16 7'^S7 

Amm(s) := -rz mm mm (2.7) 

17 JoC{l,...,p} 77^0 77,, ^ 



5 



As for fixed design, we assume that Amin(s) > with large probability. The factor 16/17 allows us to use 
the same notation Amin('S) for both fixed and random design. Let T^ss be the sub-matrix with rows and 
columns both indexed by the active set S. It then holds that 

Amin(S55) > il^^HlHiM > Q. (2.8) 
Id 

Then, a random design X as in (2.6) behaves nicely, with high probability. To be more precise, denote by 
A = — S, and consider 



j,k V n 



X := <( max | A.^l < ^ \ , (2.9) 



for some constant C2 > 4y^5/3. Throughout this paper, we assume for random design that p < e"/^'^2 , i.e 



^ < 1/2, such that X holds with probabihty at least I - ji (cf. (2.16) and Lemma 9.3). We note 
that this implies that on X, 



Vi = l,...,p, \\Xj\\l<^. (2.10) 



The set T in (2.5), intersected with X, is also relevant for random design: the constant cq equals \/3/2, 
following (2.10). 

For both, fixed and random design, we consider the quantity 

r„(5) := \\xlXs{XjXsr'\\^ , (2.11) 

where ||A||oo = niaxi<j<fc ^J^j^ for a /c x ?n matrix ^. The properties of the adaptive Lasso procedure 
depend on (an upper bound of) r„(S'). 

Finally, we denote by 

5 := Anit - P 

the difference between the initial estimate and the true parameter value. 

Theorem 2.1_ Consider the adaptive Lasso estimator in a linear model as in (1.1) with design X, where 
n < p, and for fixed design: the l2-norm of each column of X is upper bounded by co^/n for some constant 
Co > 0. 

Assume the upper bound fn > r„(S') which we require to hold only on X in case of a random design. 
Furthermore, assume on T for a fixed design and on X r\T for a random design, some upper bounds on 6 
as follows: 1 > 6s > H'^s'lloo and 1 > Jgc > Hf^s'c Suppose that on T for a fixed design and on X nT 
for a random design: 

for some I > rj > and some constant M > |, A„ is chosen from the range 



Fuithermore, assume: 



Mcoa5s^J^-^^^^^^ > K > l£o^ /^M^^. (2.12) 



_ 1 — v 

Tn < — (2.13) 
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/?„,in > max <^ 2ds, -T 7-7, -r t-ta/ — — , (^K^S^ } ■ (2-14) 

The/3, with probability 1 — P (T'^) — 1 /p'^ for a fixed design orl — F {{X PiTY) — l/p"^ for a random design 
respectively, the optimal solution (5 to (2.2) satisfies supp(/3) = supp(/3). 



A proof is given in Section 13. We furthermore ai^gue below that the sets T and X (and hence also T H X) 
have large probability. 

Remark 2.2.. In general, there ai^e multiple solutions of the adaptive Lasso in (2.2). However, with high 
probability, the solution of (2.2) is unique. This follows from Wainwright (2008) and we present more 
details in Section 12.2. 

Remark 2.3.. The last term on the right hand side in (2. 14) usually dominates all others (under the as- 
sumptions we make for the theorem): the order of magnitude is typically 0{y^slog{p)/n). Furthermore, 
for a fixed design, we emphasize that r„, 6s and Jgc are only required to hold on the set T. Similarly, for a 
random design, we only require some upper bounds to hold on the set T n X. 

Remark 2.4.. We note that Theorem 2.1 suggests that we can use any initial estimator that yields a nice 
bound on = WPmi — PWoo- consider the Lasso as initial estimator in Sections 4 and 5. The Dantzig 
selector (Candes and Tao, 2007) could be an alternative having similar properties as the Lasso under the 
restricted eigenvalue assumptions (Bickel et al., 2008). 

Lemma 2.5.. For a fixed design, we have 

r{r)>i-i/p'^. (2.15) 

Moreover, for a random design X as in (2.6) with T.jj = 1, Vj G {1, . . . ,p}, and for p < e^l^^^l, where 
C2 > 41^5/3, we have 

P(^)>l-l/p2. (2.16) 

Hence, for a random design, 

P(<Ynr) > 1 - 2/p^. 



A proof is given in Section 9 (Lemmas 9. 1 and 9.3). 



3 Restricted eigenvalue conditions 

We are analyzing in later sections the properties of the adaptive Lasso when using the standard Lasso as 
initial estimator: 

1 ^ 
Anit := argmin — ||y - X[5\\l + Ai„it V (3.1) 
p 2n ^-^ 
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where for some constant B and cq to be specified, 



Ainit = -Scoo-a/ • (3.2) 

V n 

As usual, in order to be a sensible procedure, we assume that the different variables (columns in X) are on 
the same scale. In view of Theorem 2. 1, we need to establish bounds for 6 = Piait — (3, where /3init is defined 
in (3.1). 

To derive such bounds for 5, we build upon recent work by Bickel et al. (2008) under the "restricted eigen- 
value" assumptions formalized therein, which are weaker than those in Candes and Tao (2007); Meinshausen and Yu 
(2009) for deriving Ip bounds on 5, where p = 1, 2, for the Dantzig selector and the Lasso respectively. Sim- 
ilar conditions have been used by Koltchinskii (2008) and van de Geer (2007). 



3.1 Restricted eigenvalue assumption for fixed design 

To introduce the first assumption, we need some more notation. For integers s, m such that 1 < s < p/2 
and m>s,s + m<p, a vector G and a set of indices Jq C {1, . . . ,p} with | Jo| < s, denoted by 
the subset of {1, ... ,p} coiTcsponding to the m largest in absolute value coordinates of 6 outside of Jq and 

defined Jom = Jq^ Jm- 

Assumption 3.1.. Restricted eigenvalue assumption RE{s,m,ko,X) (Bickel et al., 2008). Consider a 
fixed design. For some integer 1 < s < p/2, m > s, s + m < p, and a positive number fco, the following 
condition holds: 

^ min min J!^^^^^ „ > 0. (3.3) 



K(s,m,ko,X) JoC{i,...,p} 77^0, v^IItJo 

We often restrict ourselves to the case with ko = 3. Apparently, RE{s, m, kQ,X) implies that RE{s, ko,X) 
as in Definition 3.1 below holds with K{s, ko, X) < K{s, m, ko,X) for the same X. 

Definition 3.1.. Restricted eigenvalue definition RE{s, ko, X) (Bickel et al., 2008). Consider a fixed 
design. For some integer 1 < s < p and a positive number ko, the following condition holds: 

— min min IL „ '\ > 0. (3.4) 



K{s,ko,X) ■ ./oC{i,...,p} 77^0, V^WlJoh 



\Jo\< 



|7jg||^<fco||7Jolli 



We note that variable selection with the adaptive Lasso is possible under this weaker form of restricted 
eigenvalues, though with stronger conditions on the sparsity s and Pmin- We omit such results in this paper 
due to the lack of space. 

By an argument in Bickel et al. (2008), it is known that if RE{s, ko, X) is satisfied with ko > 1, then the 
square submatrices of size < 2s of X'^X/n are necessarily positive definite. In fact, it is clear that in (3.4), 
the set of admissible 7 is a superset of that in (2.3). Hence we have the following: 
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Proposition 3.2.. Suppose Assumption RE{s, kQ,X) holds for 1 < s < p/2 and some ko > 0. Then 

-^min 

(•^^ - K^{slko,x) ^ ^ forAyam{s) as defined in (2.3). 

Note that the quantity Amin('5) also appears in Theorem 2.1 and hence when applying it, we make use of 
Proposition 3.2. 



3.2 Restricted orthogonality assumption for fixed design 



We also present results under a stronger design condition which covers cases where the sparsity s is allowed 
to be larger than in Corollary 4.4 under Assumption 3.1, see also Corollary 4.5. We define the {s,s')- 
restricted orthogonality constant (Candes and Tao, 2007) 6s^s' for s + s' < p, which is the smallest quantity 
such that 



{Xtc,Xt'c': 



n 



< 



.«' c 



(3.5) 



holds for all disjoint sets T, T' C {1, . . . ,p} of cardinality |r| < s and |T'| < s'. 



Assumption 3.2.. Restricted orthogonality assumption. Consider a fixed design. For some integer 
1 < s < p/2, m > s, s + m < p, and a positive number ko, the condition RE{s, s,kQ, X) holds. 
Furthermore, the following condition holds: 



Amm(s) > 16koK'^{s,m,ko,X)Xinits9i^s, 
n 

s < 



96cla^K^{s,s,ko,X)logp' 



(3.6) 
(3.7) 



where ko < 3. 



With such a restriction on the sparsity, we note that (3.6) is a weaker condition than Assumption 3 in Bickel et al. 
(2008). We assume that (3.6) holds with a constant that is smaller than 2ko as in Assumption 3 of (Bickel et al., 
2008), which by itself is a sufficient condition to derive Assumption 3.1. 

We refer to Bickel et al. (2008) for more detailed discussions about these assumptions which are weaker 
than those in Candes and Tao (2007); Meinshausen and Yu (2009) and arguably less restrictive than those 
in Meinshausen and Biihlmann (2006), 
Zhao and Yu (2006) or Wainwright (2008). 

4 The adaptive Lasso with fixed design 

We first show that the restricted eigenvalue condition ensures to derive upper bounds on the ^oo -norms of 

5 := Anit - A 

Lemma 4.1» Suppose that condition RE{s, 3, X) holds for a fixed design and suppose that 

/Smin > 8K2(s,3,X)Ainit^/^, (4.1) 
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for Ainit that satisfies (3.2). Then, ttie initial estimator (3.1) in model (1.1) guarantees that on the set T as 
in (2.5), 

ll^slloo < 4K2(s,3,X)Ainit^/i, and (4.2a) 

\\Ss4oo < 3K2(s,3,X)Ai„i,s (4.2b) 

Suppose that Assumption RE{s, s,3, X) and (4.1) hold. Then on the set T as in (2.5), (4.2a) holds, 
while (4.2b) is replaced by 

\\Ss4oo < 16if^(s,s,3,X)AinitV^. (4.3) 



A proof is given in Subsection 10.2. Lemma 4.1 leads to the upper bounds 6s = 4.K^{s,3, X)Xi„ii^/^ 
and 6s<: = 16i^^(s, s, 3, X)Ainit\/s. When using these bounds in Theorem 2.1, we see that the range for 
the regularization parameter in 2.12 depends on the unknown sparsity s. This unpleasant situation can be 
improved by estimating s using a thresholding procedure as follows. 

Lemma 4.2.. Thresholding procedure. Let the assumptions of Lemma 4. 1 hold. Consider the set S that 
includes all Pj^mit for j G {!,••• ,p}, whose absolute values are larger than 4Ainit. Let s := \S\ be an 
estimate which is in the same order as the true sparsity s. More specifically, we have, on the set T in (2.5), 

S CS and s< \S\ < sK'^{s,3,X) for K > 2. (4.4) 

A proof of Lemma 4.2 is given in Subsection 10.3. 
The range 

-V/Amin(^) 



The range for the tuning parameter A is now specified as follows. For some constant '^■^(y'^o) < M < 



(l-77)coo- Y 21ogp 



n 



where < < 1, A„ is chosen such that 
16MK{s,s,ko) > ^ teJit. — ^ t> , (4-5) 



where Ainit is defined in (3.2) with B = \'24 and cq > 1 is a small constant to be specified. The following 
theorem is an immediate result when we substitute 6s<^ and 6s that appear in Theorem 2.1 with what we 
derived in Lemma 4. 1 . 

Theorem 4.3.. (Variable selection for fixed design) Consider the linear model in (1.1) with fixed de- 
sign X, where n < p, and each column of X has its £2-norm upper bounded by ^Jn. Suppose condition 
RE{s, s, 3, X) (Assumption 3.1) holds. Suppose on T, for some 1 > > 0, A„ is chosen as in (4.5) with 
K{s, s, ko) = K{s, s, 3, X) and cq = 1. Suppose s satisfies (3.7) and 

/3min > max|^-^,— jieX^Ainit^ (4.7) 



where Ainit defined in (3.2) with B = -v/24 and K = K{s, s, 3, X). Then, with probability 1 — 2/p'^, the 
adaptive estimator in (2.2) satisfies supp(/J) = supp(/?). 
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A proof is given in Section 10.4. A first corollary follows immediately from Theorem 4.3 when we substitute 

~ fs 

Tn = , as shown in Lemma 10.3, formula (10.16) with cq = 1. 

Corollary 4.4_ (Variable selection for fixed design: general bound for r„) Consider the linear model in 
(I. I) with Bxed design X, where n < p, and each column of X has its £2-norm upper bounded by ^Jn. 
Suppose that condition RE{s, s, 3, X) (Assumption 3.1) holds. Suppose that on T and for some 1 > r] > 0, 
Xn is chosen as in (4.5) with K{s, s, k^) = K{s, s, 3, X) and cq = 1, 



/3mm > max J- p^^===,^\l6K'^\initV^ (4.9) 

I (l-7?)VAmm(s) V3 I 



where Xi^n is defined in (3.2) with B = \/24 and K = K{s, s, 3, X). Then, with probability 1 — 2/p'^, the 
adaptive estimator in (2.2) satisfies supp(/3) = supp(/?). 

Using the different bound fn = from Lemma 10.3, formula (10.17), our next corollary shows that 

under Assumption 3.2, we can essentially achieve the sublinear sparsity level of (3.7) while conducting 
model selection. 

Corollary 4.5.. (Variable selection for fixed design: special bound for r„) Consider the linear model in 
(1.1) with fixed design X, where n < p, and each column of X has £2-norm upper bounded by ^Jn. Suppose 
that Assumption 3.2 holds for fco = 3 and m = s. Suppose that on T and for some 1 > ?? > 0, A„ is chosen 
as in (4.5) with K{s, s, ko) = K{s, s, 3, X) and cq = 1. Suppose s satisfies (3.7) and 

/3min > max I ^'^ , N ^ ^ j 16i^'Ainit^ (4.10) 

where Ajnit is defined in (3.2) with B = \p2A and K = K{s, s, 3, X). Then, with probability 1 — 2/p'^, the 
adaptive estimator in (2.2) satisfies supp(^) = supp(/3). 

It is an open question whether the adaptive Lasso procedure can achieve model selection consistency under 
such sparsity level under Assumption 3.1 alone. 



5 The adaptive Lasso with random design 



For a random design X as in (2.6), we make the following assumption on S. 

Assumption 5.1.. Restricted eigenvalue assumption RE{s, m, ko, S) For some integer 1 < s < p/2, 
m > s, s + m < p, and a positive number k^, the following condition holds: 

min min — --^ > 0. (5.1) 



K{s,m,ko,^) Jo<p{i,...,p}, „ 77^0, ll7Jomll2 

7jg|^<fco||7Jo|li 



\Jo\<s 

Suppose (2.8) hold and Hjj = 1, Vj = 1, . . . ,p 
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It is clear that in (5.1), the set of admissible 7 is a superset of that in (2.7). Hence we have: 

Proposition 5.1_ Suppose Assumption RE{s, s, ko, S) holds for some 1 < s < p and some Uq > 0. Then 
> K^dw) f^'- ^min(s) as defined in (2.7). 

We now show that with high probability, Assumption RE{s, m,kQ,X) holds for a random realization of X 
whose row are i.i.d. vectors from N{0, S), under Assumption 5.1, if s = o (^y^'logp) • 

Proposition 5.2.. Consider a random design X as in (2.6). Assume that E satisfies (5.1). Then, on the set 
X as defined in (2.9) and with C2 as in (2.9), X satisfies RE{s, s, ko, X) as in Assumption 3.1, with 



K{s,s,ko,X)<V2K{s,s,ko,^), -^o^^^^^c^^ItItS) ^^"^^ 
Its proof appears in Subsection 11.1. 

We can now state the result for a random design under Assumption 5.1. 

Theorem 5.3.. (Variable selection for a random design) Consider the linear model in (1.1) with random 
design X as in (2.6) with n < p and p < e^/^'^a , where C2 > 4y^5/3. Suppose that Assumption 5.1 holds 
with m = s and k^ = 3. Suppose that on the set X nT and for some < < 1, A„ is chosen as in (4.5) 
with K{s, s, ko) = y/2K{s, s, 3, S) and cq = a/3/2; suppose that 



1 . f 1 ^Arni„(s)(l - 7]) 



' - 32K^s, s, 3, S) I C2 ' ' " j V logp ^^-^^ 

where C2 is defined in (2.9) In addition /3min satisfies (4.9) with K = ^/2K{s, s, 3, S). Then, with proba- 
bility 1 — 3/p^, the adaptive Lasso estimator in (2.2) satisfies supp(/3) = supp(/?). 



A proof is given in Section 11.3. 



6 The adaptive Lasso in Gaussian graphical modeling 

Consider the problem of covariance selection described in Section 1.2. 



6.1 The many regressions pursuit procedure 

The procedure for covariance selection in a Gaussian graphical model based on a pursuit of many regressions 
has been proposed and studied in Meinshausen and BiiMmann (2006). 

Consider Xi, . . . , Xp M{0, S) as in (1.4). We can regress Xi versus the other variables {X^; k ^ i}: 

Xi = Y,(3}Xi + V^ (6.1) 
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where Vi is a normally distributed random variable with mean zero. Then, denoting by Q = S , it is well 
known that 

/?i = -f^. (6.2) 

In particular, this implies that 

there is an undirected edge between i and j 

^ Sr.i / 4^ /3j / and/or p{ / 0, 

where the last statement holds due to the symmetry of S^^. 

The estimation of the edge set can then be done by one of the following rules: 

there is an edge between i and j 44> 7^ and (]{ 7^ 0, 
there is an edge between i and j <^ /?* / or 7^ 0. 

Our obvious proposal is to use the adaptive Lasso estimates in the corresponding regressions as de- 
scribed in (6.1). The discrepancy between the "and" or "or" rule above vanishes with high probability. 

The theoretical analysis follows by our result for random design linear- models (Theorem 5.3) and control- 
ling the error over p different regressions. Let = mirij j |/3* | and s be the largest node degree. Our 
conditions on sparsity and j3mm for linear models need to hold for all p regressions simultaneously and they 
are as follows. 

Assumption 6.1.. /3* from (6.1) satisfy the conditions on Pnim as in (4.9) Vi, j G {1, . . . ,p} under As- 
sumption 5.1. 

Equivalendy, by assuming SJ^^ = 1 for all j = 1, . . . ,p (see Assumption 5.1) and due to (6.2), the non-zero 
elements of I'^ij^l ar"e required to be upper-bounded by the value of /9min- 

Assumption 6.2_ The covariance matrix S satisfies the restricted eigenvalue condition in Assumption 5. 1. 
In addition, (2.8) is required to hold on every subset S C {I, . . . ,p} such that \ S\ < s. 

Assumption 6.3~ The size of the neighborhood, for all nodes, is bounded by an integer 1 < s < p/2 that 
satisfies (5.3) under Assumption 5.1. 

The following result can then be immediately derived using the union bound for the p regression in the many 
regressions pursuit. 

Theorem 6.1.. (Covariance selection in Gaussian Graphical Models) Consider the Gaussian graphical 
model with n i.i.d. samples from (1.4), where n < p < e"/^*^2, where C2 > 4-^/5/3. Suppose that 
Assumptions 6.1 - 6.3 hold. Then, 

P (supp(S;i) = supp(S-^)) > 1 - 3/p. 
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7 Discussion 



We have presented results for high-dimensional model selection in regression and Gaussian graphical mod- 
eling. We make some assumptions on (fixed or random) designs in terms of restricted eigenvalues. Such 
assumptions are among the weakest for deriving oracle inequalities in terms of — (g = 1,2) 
(Bickel et al., 2008). We show here that under such restricted eigenvalue assumptions, the two-stage adap- 
tive Lasso is able to con^ectly infer the relevant variables in regression or the edge set in a Gaussian graphical 
model. The ordinary Lasso can easily fail since the neighborhood stability condition, or the equivalent irrep- 
resentable condition, are necessary and sufficient (Meinshausen and Biihlmann, 2006; Zhao and Yu, 2006). 
It is easy to construct examples where the neighborhood stability condition fails but the restricted eigenvalue 
condition holds for the situation where n > p, see for example Zou (2006). 

In the high-dimensional context, the relation between the neighborhood stability condition and the restricted 
eigenvalue assumption is not clear. However, the latter is a condition on an average behavior (as an eigen- 
value condition) while the former requires a relation for a maximum: thus, we conjecture that the restricted 
eigenvalue assumption is in general less restrictive than the neighborhood stability condition. In partic- 
ular; although it appears non-trivial to derive a general relation between these two conditions, one can 
certainly derive relations between them under additional assumptions; A thorough exposition of such rela- 
tions is an interesting direction for future work, given the frequent appearance of both types of conditions in 
the literature, for example in Meinshausen and Biihlmann (2006); Zhao and Yu (2006); Wainwright (2008); 
Candes and Tao (2007); Meinshausen and Yu (2009); Bickel et al. (2008). For high-dimensional Gaussian 
graphical modeling, using the reasoning above, the restricted eigenvalue assumptions we make appears in 
general less restrictive (and easier to check) than the assumptions in Meinshausen and Biihlmann (2006) 
and in Ravikumar et al. (2008) who analyze the GLasso algorithm Banerjee et al. (2008); Friedman et al. 
(2008b). 

8 Analysis of the weighted Lasso 

In the sequel, for clarity, we denote by (3* the true parameter in the linear model (1.1). Inspired by the 
adaptive Lasso estimator defined in (2.2), we consider here the weighted Lasso with weights < Wj {j = 
1, . . . ,p) which solves the following optimization problem: 



The only distinction between the adaptive and weighted Lasso is that we assume that the weights are esti- 
mated in the former and pre-specified in the latter approach. However, our theory below though does not 
depend whether the weights are random or not. For convenience we denote by 




(8.1) 



WmsixiS) = max Wi 



= mm Wn. 



(8.2) 



I' 



A slightly stronger notion than inferring the support of /3* is the recovery of the sign-pattern: 



sgn(/30 = sgn(r). 
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Furthermore, there are generally multiple solutions of the adaptive Lasso estimator in (2.2) and in the 
weighted Lasso in (8.1). However, with high probability, the solution is unique, see also Remark 2.2 and 
Section 12.2. 

As before, we denote by ||^||oo = maxi<j<fc X^jLi for a /c x m matrix A. First, let us state the 
following conditions that are imposed on the design matrix for the ordinary Lasso by Zhao and Yu (2006) 
and Wainwright (2008): 

\\Xj.XsiX^Xsr'\\^ < 1-7?, for some r/G (0,1], and (8.3a) 
A^i„(ixJXs) > A,^in(s)>0, (8.3b) 

where Ay^n{A) is the smallest eigenvalue of A. Note that the second condition coincides with ours in (2.4). 
Meinshausen and Biihlmann (2006) formulated such conditions for a random design. 

We impose the following incoherence conditions on the weighted Lasso. 

Definition 8.1.. {{w, S") -incoherence condition) Let X be an n x p matrix and let S C {1, . . . ,p} be 

nonempty. Let w = {wi,W2, ■ ■ ■ , Wp)^ be a weight vector, where Wj > OVj. Let b = {sgn{/3*)wi)i^s- We 
say that X is {w, S)-incoherent if for some rj G (0, 1), 

VjGS^ \xJXs{X'^sXsr^h\ < Wj{l-7]), (8.4a) 

Amin ivX^^s) > Amin(s) > 0, (8.4b) 

where a sufficient condition for (8.4a) is 

VJG5^ \\xlXs{X^Xsr'\\^ < ^^^^^(1-7?). (8.5) 

We now state a general lemma about recovering the signs for the weighted Lasso estimator as defined 
in (2.2). 

Lemma 8.2_ (Sign recovery Lemma) Consider the linear model in (1.1) where the design matrix X satis- 
fies (8.4a) and (8.4b). LetCQ = maxjgs'c \\Xj Hg/^/n- Suppose thatwj > 0,\/j = 1, . . . ,p and A„ is chosen 
such that 



\ fQC\ ^ 4coc7_/21og(p-s) 



rj \ n 

where W]jim{S'^),Wyaa.x{S) are as defined in (8.2). Furthermore, assume 



/3^in > maxr^-^y— — , j^-j- (8.6) 



Then for pin (8.1): 

P (sgn(^) = sgn(/3*)) > 1 - 2/p\ 



Moreover, with T defined in (2.5), we have P (^(sgn(/3) / sgn(/3*)) n Tj <2/p\ 

A proof is given in Section 12.3. Note that in case i«min(5"^) = w^maxl-S*) = 1, conditions (8.5) and (8.6) 
reduce to (8.3a) and the the statement of Lemma 8.2 is exactly the same as Theorem 1 in Wainwright (2008). 
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9 Proof of Lemma 2.5 



Lemma 9.1.. For fixed design X with maxj \\Xj ||2 < Co\/n we have for T as defined in (2.5), 



(9.1) 



Proof. Define the random variables 




i=l 



Note that maxi<j<p \Yj\ = ||X'^e/n||oo. We have E {Yj) = and Var(yj) = " ^^r" < Obviously, 
1^- has its tail probabihty dominated by that of Z ~ A^(0, -^): 



We now show that F{X)>l-l/p^. 

We denote T^u := af throughout the rest of this proof. We first state the following large inequality bound 
for the nondiagonal entries of S, adapted from Lemma 38 (Zhou et al., 2008) by plugging in af = 1, = 
1,. . . ,p and using the fact that — \pjk^j^k\ ^ 7^ ^> where pjj^ is the correlation coefficient 

between variables Xj and Xj.. 

Lemma 9.2- (Zhou et al., 2008) Let = (1 + /2- ForO<T < "^jj,, 




We can now apply the union bound to obtain: 




By choosing t = CqCt^ y/6 log{p)/n, the right-hand side is bounded by 



□ 




(9.2) 



We now also state a large deviation bound for the Xn distribution Johnstone (2001): 




(9.3) 



Hence by the union bound, we have j = 1, . 



p, for T < 1/2, 




(9.4) 



16 



Lemma 9.3- For a random design X as in (2.6) with T,jj = 1, Vj G {I, . . . ,p}, and forp < e"/'^'^2 , where 
C2 > 4 a/5/3, we have 



¥{X)>1- l/p2. 

Proof. Now it is clear that we have p{p — l)/2 unique non-diagonal entries ajkj^j / k and p diagonal 
entries. By the union bound and by taking r = C2y^^^ in (9.4) and (9.2), we have 



'{X") = P max|Ajfc|>C2 



logp 



jk V n 

3Cllogp \ ^ p^-p [ 3C|logp 
< p exp H exp ' 



16 y 2 "V 20 

^ 2 / 3C|logp\ „££i+2 1 
< ^^'expf |^J=1' <^ 



for C2 > 4^5/3. Finally, p < e"/'^'^' guarantees that C2\/ ^ < 1/2. □ 



10 Proofs for Section 4 



Throughout this section, we have Ainit = BcQa^y^^^ with B = \/24 



10.1 The Lasso as initial estimator 

Lemma 4.1 crucially uses the bound on the £i-loss of the initial Lasso estimator 

Our proof follows that of Bickel et al. (2008). Let (3i„ii be as in (3.1) and 6 = /3init — /?*. The set T is defined 
in (2.5). We first show Lemma 10.1; we then apply condition RE{s, ko,X) on 6 with ko = 3 under T to 
derive various norm bounds. 



Lemma 10. 1~ For fixed design, onT, H^^cH-^ < 3 



1- 



Proof. Since /3init is a Lasso solution, we have 

Ainit \\0* 111 — A 



1 2 1 * 2 

1 — ^init llAnitlli ^ ^ ||^ — ^Anit|l2 ~ ^ 11^ ~ ^/^*ll2 



Hence on the set T as in (2.5), we have 



ll^^lln < 2Ai, 



2A 



init 1 1 Wnit 1 1 1 



n 



+ 2 



n 



< Ainit(2||r|ll-2||Anit|li + ||5|li), 



(10.1) 
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where by the triangle inequality, and = 0, we have 

< 2||/?*||i-2||Anit|li + ||5|li 

= 2 \\P*sh - 2 llfetlli - 2 ||,55.||i + \\Ss\\, + \\Ss4i 

< 3\\5sh-\\6s4i- (10.2) 

Thus Lemma 10.1 holds. □ 

Proposition 10.2.. (^p-loss for the initial estimator, (Bickel et al., 2008)) Consider the linear model in 
(1.1) with fixed design satisfying maxj 1 1 1 1 2 < cq -v/n. Suppose that RE{s, 3,X) holds. Let 5 = /?init — /3* 
with Pinit defined in (3.1) with 



Ainit = Bcoa^ 

Then, on the set T in (2.5), 



logp 



n 



\\6s\\2 < iK\s,3,X)XinitV^s. (10.3) 

Pill < AK\s,3,X)Xi,^s; (10.4) 

Moreover, under the stronger assumption RE{s, s, 3, X), and on the set T as in (2.5), 

< WK\s,s,3,X)Ki,^. (10.5) 

Proof. On the set T, by (10.1) and (10.2), 

11^511^ + Ain,||5||i < A„,in(3||<55|li-||5sHll + ll'55|li + ||5sHll) 

= ^Kii\M\i<^KitV^\\Ss\\2 (10.6) 

< 4Ainit^/ii^(s,3,X)||X5t (10.7) 

< AK\s,3,X)Xf,,,s + \\X6\\l, 

where (10.7) is due to condition RE{s, 3, X) and Lemma 10.1. Hence (10.4) holds. Now by RE{s, 3, X) 
and (10.6), we have 

\\6s\\l<K\s,3,X)\\X5\\l < K\s,3,X)AXi„i,^\\Ss\\2. (10.8) 
Hence (10.3) holds. Finally, on the set T, given Lemma 10.1, by RE{s, s, 3, X) and (10.6), we have 

WSss'Wl < K\s,s,3,X)\\X6\\l 

< K\s,s,3,X)4\ini,^s\\6s\\2 

< K\s,s,3,X)4XMtV^\\6ss'\\2- 

Hence from the following inequahty (10.9) (e.g., cf. (B.28) in Bickel et al. (2008)) 

||<5||2 < (1 + fco) 11^55' ||2> (10.9) 

we obtain (10.5). □ 
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10.2 Proof of Lemma 4. 1 



By Proposition 10.2, and (B.26) in Bickel et al. (2008), 

\\6s\\2 < 4^^(s,3,X)2Ainit^/i, 

||<5||i < 4K{s,3,XfXinitS, where 
\\Ss4i < 3 Pill, 

due to a property of the Lasso estimator (see, for example Bickel et al. (2008)). This allows us to conclude 
that on the set T as in (2.5), 



oo 



< \\6s\\2<^K{s,3,XfXMtV^, (10.10) 
\\6s4i < ^ll^lli <3K(s,3,X)2AinitS. (10.11) 



Thus we have by (4.1), (10.10) and (10.1 1), 



Vie 5, |A,imt| > l3min-\\6s\L>'^K{s,3,XfKi,V^, (10.12) 
Vj G S', |/3,-Mt| < \\6s4oo < ¥s4i < 3K{s,3,XfXini,s. (10.13) 

□ 



10.3 Proof of Lemma 4.2 

If we threshold Pj^it at the value of 4Ainit, by (10.12), we have S ^ S. Moreover, by (10.11), we include at 
most 3K{s, 3, X)'^s/4 more entries from in S; thus for K{s, 3, X) > 2, 



s<\S\<s+ 3^-^("^3,X)2 ^ sK{s,3,Xf. 
In addition, we have Vj G 5^, by (10.5), 

|/3,,init| < \\Ss4oo<\M2 

< 16K'^is,s,3,X)XinitV^, 
under Assumption RE{s, s, 3, X) and condition T. □ 



10.4 Proof of Theorem 4.3 

It is clear that once we finish checking conditions on A„ in (2.12), on s as in (2.13) and on /Jmin as in (2.14) 
hold, we can invoke Theorem 2.1 to finish the proof. Formula (4.1) is satisfied assuming (4.9). Hence by 
choosing 

■■= 4K\s,3,X)Xir,^^s, (10.14) 
5s^ := 16K2(s,s,3,X)Aini,V;, (10.15) 
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we have 63 > WSsW^o ^^d Ss'^ > Ps^H^ by (4.2a) and (4.3). Now by (4.4), 



64:aK^{s,s,3,X)Xinity\S\ /21og(p- s) 

An, > 



7] V n 



> — \ 16K {s,s, 3, X)XinitVs 

r] \ n 

40-55C /21og(p-s) 



r/ V n 



and 



16Mi^'2(s,3,3,X)cjAinitY'|5'| /21og(p- s) 

An < 



ii:(s,3,X) V n 



V n 



and thus (2.12) holds with cq = 1. Furthermore, for the sparsity s, (4.6) guarantees that (2.13) holds 
by (10.15). Finally, regarding /3min, (2.14) holds given (4.7), as i6AfJ^,ni,V5 clearly dominates the first and 

the third term in (2. 14) by the definition of (10. 14) and (10. 15), and the fact that . ^ . . < K'^is, ko,X) by 
Proposition 3.2; and it also dominates the second term given (3.7) and the upper bound on A^- □ 



10.5 Bounds for 

Lemma 10.3.. Consider a fixed design X wiA maxj ||Xj||2 < co^/n and assume that (2.4) liolds. Then 
for all subsets S with \S\ < s, 



\X^cXsiXjXs) < 



(10.16) 



< T^^, (10.17) 

Amin(s) 



where ^1 ^ is given in 3.5 



Proof. As a shorthand, we let Ps = Xs{Xg Xs) Xg denote the projection matrix and define 

Vj G 5^ r, = {X^Xsr'X^X,. 
Bounding 1 1 rj || Vj yields a bound on r„. First we have for all j £ S'^, 

WXsrjW^ = \\Xs{XjXsr'XjXj\\^ = WPsXjW^ (10.18) 
< ||A'j||2 < co\/n. 
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On the other hand, by the restricted eigenvalue assumption, we have 



\\Xsr.j\\l=rJxlXsr>nKnr 



Thus we have that llrJL < . ^" ,V?' G 5^, and hence 



n 



yi\\2 ■ 



n = max ||r,-||-| < max-v/s IkilU = \/smax ||ri|L < 



C0^/s 



Next we note that using (3.5), we can bound r„ as follows, which has essentially been shown in Candes and Tao 
(2007). For PsXj = Xsrj, with 



\\Xsr 



JII2 



J\\2 



we have 



\\PsX,\ 



{PsXj,Xj) _ {Xsr^Xj) 



n 



n 



n 



\\Xsr, 



3 Il2 



3 Il2 



Hence, 



\\PsXj\L < 



and r„ < 



Amm(s) ' 



11 Proofs for Section 5 



□ 



11.1 Proof of Proposition 5.2 



We first bound ||X7||^ - 7'^S7. 



7"^S7 — 7"^S7 



V V 



3=1 k=l 



< 



+ 



jeS'^ k&S" 



jeSkeS'' 



+ 2 



< max |A,-fc| ( hsill + 2 hsili h^Hli + \M\i) , 
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where A = S — S. Now given that ||7s':||i < \\js\\i and 117511^ < s II75II2, we have 

\\X-f\\l - 7^S7 < max \A,k\ hsWl (1 + 2ko + k^) 
< max|Ajfc| ||75||^ (1 + kof < s(l + A;o)^ max | Ajfc| II75II2 . 

j,K j,k 

Let 75-5' = 75 U 75', where 75/ denote the subset of {1, ... ,p} coiTesponding to the s lai^gest coordinates 
of 7 in their absolute values in 75-0. We have on A!, using Assumption 5.1, 

\\X-f\\l > -f'^^-f - s{l + kof max \Ajk\ hsWl 



K(s,s,fco,S)^ 

and hence (5.2) holds. 



ll75S"ll2 



ii2 

-s l + fco max Ajfc 75 2 > TTTTT ; — 



□ 



11.2 Eigenvalue bounds 

We now show that (2.4) is satisfied with high probability for a random design X, given its population 
con^espondent as in (2.8). 



Lemma 11.1- Let X hea random design as in (2.6). Lets < ^^j^^f^forC2 as defined in (2.9). We 
have on ttie set X, 



A, 



(XlXs\ 



\ n 



1 > Amin(s), 



(11.1) 



for all subsets S C {l,...,p} with \S\ < s where (2.8) hold. 



Proof. On the set X, for all subsets S with |5| < s, 

Amin ( — ] - Amm(S5s) 



n 



< 



< 



X^Xs 
n 



^ss 



-'SS 



n 
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(11.2) 
(11.3) 
(11.4) 



where ||.||2 denotes here the operator norm of a matrix. (11.2) is a standard result in matrix perturbation 
theory, (11.3) is due to the fact that S and S are symmetric, and (11.4) is due to (2.9) and the bound on 
s. Hence for all subsets S with \S\ < s that satisfy AminlSss) > Y^Amin(s) (as defined in (2.7)), (11.1) 
holds. □ 
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11.3 Proof of Theorem 5.3 



As corollary of Lemmas 10.3 and 1 1.1, we have 

Corollary 11.2.. Consider a random design X. Then on the set X defined in (2.9), (10.16) holds with 
Co = ^/tj2, for all subsets S with \S\ < s. 



It is clear that (4.1) is always satisfied given (4.9), where K = \/2K(s, 3, 3, E), as K{s,s,ko,X) < 
^/2K{s, s, ko, S) by Proposition 5.2. We now show that the conditions on A„, s and /3min as required by 
Theorem 2.1 are satisfied onXCiT. First we take 

6s := 8K\s,s,3,^)XinnV^ (11.5) 
6sc := 32K\s,s,3,J:)XinitV^, (11.6) 

^ (11.7) 



V^2Amm(s) 

where (11.7) holds by Corollary 11.2, for which 



^ ^ 1 ^ Amin(s) I~1T 



32C2i^2(s 5 3 s) - I6C2 y logp' 
by Proposition 5.1. It is clear- that 

6s > 4K\s,s,3,X)X,,i,^>\\6s\\^, and (11.8) 

6sc > l6K\s,s,3,X)XinnV^,>\\6s4oc (11-9) 

given (4.2a) and (4.2b), and Proposition 5.2. Regarding the condition on An, by Proposition 5.2, (4.4) 
and (1 1.6), we have 



128coair2(s,s,3,S)Ai„tA/ S 
Xn > 



2 log(p — s) 



> 



7] V n 

Acoa{32K^{s,s,3,^)Xinii^) l2\og{p-s) 



7] V n 



4cocr55c /21og(p-s) 



r/ V n 



and 



16MV2K{s, s, 3, T,)K{s, s, 3, X)coaXi^n,^\S\ j2log{p - s) 

An < 



K{s,s,3,X) V n 



< Mcoa32K^{s,3,3,J:)XinitV^\^^^^ — - 

V n 



V n 

where we used the fact that K{s,3,X) < K{s, s,3, X). Hence (2.12) is satisfied. In addition, for 
K = \/2K{s, s, 3, S), the sparsity condition (2.13) holds by Corollary 11.2. Condition (4.7) implies that 
the condition (2.14) for /3min holds, given (11.5) and (11.6) and Proposition 5.1. We can then invoke Theo- 
rem 2. 1 to finish the proof with cq = \/3/2. □ 
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12 Proof of the sign recovery Lemma 



12.1 Preliminaries 

We first state necessary and sufficient conditions for the event sgn(/?) = sgn(/3*). Note that this is essentially 
equivalent to Lemmas 2 and 3 in Wainwright (2008). First, for S = X^X/n, let Srt = ^^r^t be the 
submatrix of S with rows and columns indexed by R and T respectively. 

Lemma 12.1.. Letb := {sgn{P*)wj)jfzs. Let w = {wi,W2, ■ ■ ■ ,Wp), where wj > 0,Vj, be a positive 
weigttt vector. Assume that the matrix XjXs is invertible. Then for any given A„ > and noise vector 
e G M", there exists a solution (3 for the weighted Lasso such that 

sgn0) = sgn(/3*), 

if and only if the following two conditions hold: 



^S''si'^S,s) 



XTe 



n 



Xqce 



n 



sgn /?J + (S5s)^' 



n 



= sgn(/3j). 



(12.1a) 
(12.1b) 



Finally, if (12.1a) holds with strict inequality, then the solution of the weighted Lasso is unique. 



Proof. Recall that we observe Y = Xj3* + e and b := {sgn{P* )wi)i^s- Let w = {wi,W2, ■ ■ ■ , Wp) be the 
weight vector. 

First observe that the KKT conditions imply that /3 E is a solution, if and only if there exists a subgradient 

g G d''^'Wj\Pj\ = {z £ MP\ Zi = sgn(/3)w.j for (3i / 0, and \ zj\ < Wj otherwise} 



such that 



-X^Xp - -X^Y + Kg = 0, 
n n 



which is equivalent to the following linear system by substituting Y = Xfj* + e and re-arranging: 

(3*)--X^e + \ng = Q. 
n 



(12.2) 



(12.3) 



Hence, given X,j3* ,e and An > the event sgn(/3) = sgn(/3J) holds if and only if 

1. there exist a point [3 and a subgradient g G dY^^j=i Wj\(3j\ such that (12.3) holds, and 

2. ^gn{(3s) = sgn(/3^) and = (3*gc = 0, which implies that gs = 5 and \gj\ < WjMj G S'^ by 
definition of g. 

Plugging /35c = (3*gc = and gs = hm (12.3) shows that sgn(/3) = sgn(/3*) if and only if 
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1. there exists a point /? G and a subgradient g G dY^^j=i such that 



n 



n 
Kgs 



-Kb, 



(12.4a) 
(12.4b) 



2. and sgn(/3s) = sgn(/3J) and Ps- = P*sc = 0. 



Using invertibility of XgXg, we can solve for Ps and using (12.4a) and (12.4b) to obtain 



^ Kb 



n 



n 



n 



Thus, given invertibility of XgXg, sgn(/3) = sgn(/3*) holds if and only if 

1 . there exists simultaneously a point /? G and a subgradient g G d Yl^=i I Pj I ^^^^ 



Kgs'' = T,scsiT,ss) ^ 
Ps = Pl + i^ss)-^ 



^ - Xnb 
n 

- Xnb 

n 



n 



(12.5a) 
(12.5b) 



2. and sgn{Ps) = sgn(/?* ) and ps^ = P%c = 0. 



Thus, for sgn(/3) = sgn(/?*) to hold, there exists simultaneously a point P £ W and a subgradient g G 
d J2^=i ''^jlf^j I such that 



^ss) 



sgn{Ps) = sgn(p*s + {J:ss 



^ - Xj 



n 



n 



-X'se-Xnb 



n 



\-XngsA < Kws'^ 
sgn(/3J), 



given that \gsA ^ ^S'^ by definition of g. Thus (12.1a) and (12.1b) hold for the given X,P* ,e and A„ > 0. 
Thus we have shown the lemma in one direction. 

For the reverse direction, given X,/?*, e, and suppose that (12.1a) and (12.1b) hold for some An > 0, we 
first construct a point P ^Why letting Ps<^ = Pgc = and 



Ps = p*s+{tssr^ 



1 ^ 

-Xje - Xnb 
n 



which guarantees that 



sgn(/35) = sgn [ P*s + {T. 



^ss) 



-Xje - Xj 
n 



sgn(/?J) 
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by (12.1b). We simultaneously construct 5 by letting gs = b and 

= (^s^si^ss)-' \-X^e - Xj] - -X^e) , (12.6) 

An \ in ] ^ / 

which guarantees that \gj\ < Wj,'\/j G 5^ due to (12.1b); hence g G dYjj=iU!j\f3j\. Thus, we have 
found a point (3 G M.P and a subgradient g G dYl^=i ^il/^jl ^^'^^ ^^^^ sgn(/3) = sgn(/3*) and the set of 
equations (12.5a) and (12.5b) is satisfied. Hence, by invertibility of XjXs, sgn(/3) = sgn(/3*) for the given 

X,(3*,e,Xn. □ 

12.2 Uniqueness of solution 

Finally, the uniqueness proof follows a similar argument in the revised draft of Wainwright (2008). We omit 
the details. In fact, it is illustrative to rewrite the adaptive (or weighted) Lasso program as follows: Let 
W = diag(t(;i, . . . , Wp), for wj > 0, and let the solution to (2.2) be 

P = W-^Po, where 

A) := argmin;l||y- W-Vo||i + A„||/3o||i. (12.7) 
/3o 2n 

Now we can just take XW""^ as the design matrix and /?o := W(3 as the sparse vector that we recover 
through Pq, by solving the standard Lasso problem as in (12.7). It is clear that uniqueness of (3q to (12.7) is 
equivalent to uniqueness of /3 as is a positive-definite matrix. 

12.3 Proof of Lemma 8.2 

Let Cj G be the vector with 1 in i*^ position and zero elsewhere; hence \\ei\\2 = 1- 
We first define a set of random variables that are relevant for (12.1a) and (12.1b): 

VjGS^ Vj := XjXs{XlXs)-^Xnh + Xj {ln^n-Xs{XlXs)'^Xl]-, 

yieS, Ui := efi^X^Xsy ^ 

Condition (12.1a) holds if and only if the event 

£{V) := {Vi G S^ \Vj\ < XnWj} 

is true. For Condition (12.1b), the event 

£{U) := \max\Ui\ < 

i&S 

is sufficient to guarantee that Condition (12.1b) holds. 



26 



We first prove that P {£{V)) and P {£{U)) both are large. 
Analysis of Note that 

fij = E (Vj) = \nXjXs{XlXs)-% j G 

By (8.4a), we have Vj G 



< XnWjil-r]). 

Denote by P = XsiX'^ Xsy^X'^ = the projection matrix. Let 



Vj — Xj I [inxn 



Xs{XsXs) ^Xp 



n 



] 



which is a zero-mean Gaussian random variable with variance 

2 2 
Var(F,) = "-^Xj - P)] [(/nxn " P)]"} X, < ^ ||X 

since || / — P||2 < 1. Using the tail bound for a Gaussian random variable 



2 C\ 
3\\2 



2^2 




n 



> t] < 



Var(y,) / _t2 

exp ^ 

V2Var(y,: 



, crco 
< — — exp 



2^ 



with t 



n'^^min 



> 2coa 



21og(p-s) 



and the union bound, we have 







( max 









> 



< 



< 



{p — s) exp (— 41og(p — s)) 
2V21og(p-s) 
1 

2{p-sfy/2log{p-s) ' 



Thus, with probability at least 1 



2{p-sy ' 



VjGS^ iFjl < I/XjI + IFjI < A„t/;j(l-??) + 

< \nWj{l - 7]/2), 

and <?(y) holds; in fact, it holds with straight inequality for rj > 0. 
Analysis of £{U). By the triangle inequality, and on the set T, 



max \ Uj\ < 



[xlXs/n) ' 



\X'se/n\\^+ {X'sXs/n) 



< , ^, , (coc7y^2Alogp/n + X 

Amin(s) V / 



where 



[X^Xs/n) ' <V~s {X^Xs/n) ' 



< 



Amin {XjXs/n) Amin(s 
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by standard matrix norm comparison results and the restricted eigenvalue assumption. Hence, £{U) holds 
on the set T. Denote by = £{Uy U £{Vy. Then we have 

< F{T'')+F{£{Vy) <2/p'^ 
by Lemma 9.1 and the analysis of £{U) and as £{V) above. 



13 Proof of Theorem 2.1 

We note that for a fixed design X, once we finish checking that the incoherence conditions and conditions 
on A„ and Pnim as in (8.6) ai^e satisfied, we can then invoke Lemma 8.2 to finish the theorem. For a random 
design, our proof follows the case of a fixed design after we exclude the bad event X'^ for X as defined 
in (2.9). We now show that on ^ n T, where for a fixed design X'^ = 0, all conditions in Lemma 8.2 for 
Cq = 3/2 are indeed satisfied. 

First by Lemma 1 1.1, we have Amm(-^J-'^s/'i^) > Amin(s) and hence (8.3b) hold under ^nT, given (2.8). 
Now we have by /3min > 2(5^ > 2 H^sIL' 



y j e S, |/?j-init| > /3mm - 11(55 lloo ^ ^-y^ and hence 
^fmax < max <; -T^, 1 J> . 



Pn 

It also holds by 1 > > yj^-c ||^ 

Vj G S", |/3j,init| < \\6s4oo < ^5^= < 1 and Wmin > TTY^ 

Hence the choice of A„ in (2.12) guarantees that 



, . A„ A„ 4cofT /21og(p-s) 



¥s4oo 6sc V ^ n 
We now show that the incoherence condition as in (8.5) holds given > Vn- 

1. Suppose /3niin < 2 satisfies (2.14), we have lUmax = 2//3min and hence 



— II t- II — -r ^ fn ^ '^n- K'^^-'^) 



2. Suppose Pniin > 2: then Wmax(5') = 1 and by assumption. 



^«max ¥s4oo 6sc 
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It is clear that (8.6) is satisfied given (2.14), if 



/5min > max 



{ 




4A„^ 2\n^/s 



} 



(13.2) 



We only need to be concerned with the first term: given the last two terms in the Pram bound, we have 



where X'^ = for a fixed design, and the last term has been bounded using Lemma 8.2 for a fixed design or 
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